The virtual try-on technologies based on image synthesis mask strategy can better retain details of the clothing when the warped clothing is fused with the human body. However, because the position and structure of the human body and the clothing are difficult to align during the try-on process, the try-on result is likely to produce severe occlusion, affecting visual effect. In order to solve the occlusion in the try-on process, a U-Net based generator was proposed. In the generator, a cascaded spatial attention module and a channel attention module were added to the U-Net decoder, thereby achieving the cross-domain fusion between local features of warped clothes and global features of the human body. Formally, first, by predicting the Thin Plate Spline (TPS) conversion using the convolutional network, the clothing was distorted according to the target human body pose. Then, the dressed-on person representation information and the warped clothing were input into the proposed generator, and the mask image of the corresponding clothing area was obtained to render the intermediate result. Finally, the strategy of mask synthesis was used to synthesize the warped clothing with the intermediate result through mask processing to obtain the final try-on result. Experimental results show that the proposed method can not only reduce occlusion, but also enhance image details. Compared with Characteristic-Preserving Virtual Try-On Network (CP-VTON) method, the proposed method has the generated image with the average Peak Signal-to-Noise Ratio (PSNR) increased by 10.47%, the average Fréchet Inception Distance (FID) decreased by 47.28%, and the average Structural SIMilarity (SSIM) increased by 4.16%.